The "warm up" challenge for this year is adapted from the well-known 'Wine Quality' challenge on Kaggle. In particular, given a dataset containing several attributes describing wine, your task is to make predictions on the quality of as-yet unlisted wine samples. Developing a model which accurately fits the available training data while also generalising to unseen data-points is a multi-faceted challenge that involves a mixture of data exploration, pre-processing, model selection, and performance evaluation.
IMPORTANT: please refer to the AML course guidelines concerning grading rules. Pay especially attention to the presentation quality item, which boils down to: don't dump a zillion of lines of code and plots in this notebook. Produce a concise summary of your findings: this notebook can exist in two versions, a "scratch" version that you will use to work and debug, a "presentation" version that you will submit. The "presentation" notebook should go to the point, and convay the main findings of your work.
Beyond simply producing a well-performing model for making predictions, in this challenge we would like you to start developing your skills as a machine learning scientist. In this regard, your notebook should be structured in such a way as to explore the following tasks, that are expected to be carried out whenever undertaking such a project. The description below each aspect should serve as a guide for your work, but you are can also explore alternative options and directions. Thinking outside the box will be rewarded in these challenges.
You will be working on two data files, which will be available in /mnt/datasets/wine/, one for red and one for white wines:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference Cortez et al., 2009. Only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
A possible trick is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. Note that this can be seen as a data preparation task.
We leave to the students to decide how to carve out training and test sets (validation sets too, if relevant to your approach). This is non a competition whereby the instructors hold a "private" test set to rank students' models.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVR, SVC
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_predict, train_test_split, GridSearchCV
from google.colab import files
uploaded = files.upload()
red = pd.read_csv("winequality-red.csv", sep=';')
white = pd.read_csv("winequality-white.csv", sep=';')
print("Import complete")
Data exploration: The first broad component of your work should enable you to familiarise yourselves with the given data, an outline of which is given at the end of this challenge specification. Among others, you can work on:
#red wine
red.head() #we notice that all data are numerical
red.describe() #to have some statistics
red.info() #we notice that there is no empty value
sns.pairplot(red)
There is no clear correlation between any element and the quality... Maybe alcohol or density ?
# let's do the same with white wine
white.head() #we notice that all data are numerical
white.describe() #to have some statistics
white.info() #we notice that there is no empty value
sns.pairplot(white)
Data Pre-processing: The previous step should give you a better understanding of which pre-processing is required for the data. This may include:
Note that, as the name implies, this is a warm-up challenge, which essentially means that data is already put in a convenient format that requires minimal pre-processing.
# Let's normalize
r = red.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
r_scaled = min_max_scaler.fit_transform(r)
red_normalized = pd.DataFrame(r_scaled, columns=red.columns, index=red.index)
#red_normalized.head()
w = white.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
w_scaled = min_max_scaler.fit_transform(w)
white_normalized = pd.DataFrame(w_scaled, columns=white.columns, index=white.index)
#white_normalized.head()
red_y = red['quality']
red_X = red_normalized.drop(['quality'], axis=1)
white_y = white['quality']
white_X = white_normalized.drop(['quality'], axis=1)
Feature selection doen't seem necessary here... nor removing outliers !
Think about combining features ?? We don't have that much features.
An important part of the work involves the selection of a model that can successfully handle the given data and yield sensible predictions. Instead of focusing exclusively on your final chosen model, it is also important to share your thought process in this notebook by additionally describing alternative candidate models. There is a wealth of models to choose from, such as decision trees, random forests, (Bayesian) neural networks, Gaussian processes, LASSO regression, and so on.
Irrespective of your choice, it is highly likely that your model will have one or more parameters that require tuning. There are several techniques for carrying out such a procedure, such as cross-validation.
Why this model ? blablabla
We first chose to use reggresion models since this plot seems to indicate a gradient in the quality that seems even almost linear:
plt.scatter(red_X.to_numpy()[:, 1], red_X.to_numpy()[:, 10], marker='o', c=red_y, s=25, alpha=0.8)
This is why, even though it seems to be a classfication problem, regressive models can nonetheless be relevant.
Data count:
values, count = np.unique(red_y, return_counts=True)
print(values, count)
values, count = np.unique(white_y, return_counts=True)
print(values, count)
# Separation of test set and training set
Xr_train, Xr_test, yr_train, yr_test = train_test_split(red_X, red_y, test_size=0.2, random_state=0)
Xw_train, Xw_test, yw_train, yw_test = train_test_split(white_X, white_y, test_size=0.2, random_state=0)
Since the mae or mse scoring is not relevant to compare classifiers and regressors, we chose to cross-validate on the predictive performances of each model.
The plot just above make us think that a simple linear regression would not have that bad results since we have somehow a smooth gradient of the quality over some features (corresponding to the color gradient in the plotted figure). A linear regression model will thus be a starting point that can then be used to evaluate the performance of the other more complex models we will use after.
### Linear Regression
model = LinearRegression()
# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = np.around(cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)).astype(np.int64)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), np.around(ypred).astype(np.int64)))
# TO BE COMPLETED # --> Random forest regressor choice
### RANDOM FOREST
model = RandomForestRegressor(verbose = 0)
# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = np.around(cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)).astype(np.int64)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), np.around(ypred).astype(np.int64)))
# TO BE COMPLETED # --> GBR choice
### GRADIENT BOOSTING REGRESSOR
model = GradientBoostingRegressor(verbose = 0)
# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = np.around(cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)).astype(np.int64)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), np.around(ypred).astype(np.int64)))
# TO BE COMPLETED # --> SVR choice
### SVM REGRESSOR
model = SVR()
# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = np.around(cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)).astype(np.int64)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), np.around(ypred).astype(np.int64)))
Maybe, regressive models are not as relevant as we first thought, so we tried to see the performances of classification models.
First, logistic regression as a classifier is probably the simplest model, was the linear regression.
### Logistic Regression
model = LogisticRegression()
# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), ypred))
# TO BE COMPLETED # --> Random forest choice
### Random forest classification
model = RandomForestClassifier()
# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), ypred))
# TO BE COMPLETED # --> SVC choice
### SVM CLASSIFIER
model = SVC()
# With cross walidation scores, we have a better estimation of the efficiency of the model
ypred = cross_val_predict(model, Xr_train, yr_train.ravel(), cv=5)
print("cross validation accuracy =", accuracy_score(yr_train.ravel(), ypred))
print("confusion matrix =\n", confusion_matrix(yr_train.ravel(), ypred))
KNN may be a relevant classifier, though "expensive" in terms of prediction computing. We also need to estimate the right "k", the number of neighbors taken into account.
n_neighbors = np.arange(1, 50, 1)
accs = []
for n_neighbor in n_neighbors:
model = KNeighborsClassifier(n_neighbors=n_neighbor, weights='distance')
ypred = cross_val_predict(model, Xr_train, yr_train.ravel(), cv=4)
accs.append(accuracy_score(ypred, yr_train))
plt.plot(n_neighbors, accs)
plt.xlabel("k (number of neighbors)")
plt.ylabel("cross validation accuracy")
plt.show()
We see that k is optimal around 15 with a cross validation accuracy of 0.65.
Best results obtained with a random forest classifier (without hyperparameter tuning) and KNN (with optimal K found)
To optimize our forest model, we'll use grid search.
model = RandomForestClassifier(verbose=0)
model.get_params().keys()
parameters = {
"n_estimators":[10, 15, 20],
"min_samples_split": np.linspace(0.001, 0.005, 5),
"min_samples_leaf": np.linspace(0.0001, 0.001, 5),
"max_depth":[12, 8, 4],
"max_features":["auto", "log2", "sqrt"],
"criterion": ["gini", "entropy"],
}
grid_search = GridSearchCV(model, parameters, verbose = 1, n_jobs=-1)
grid_search.fit(Xr_train, yr_train.values.ravel())
print(grid_search.score(Xr_train, yr_train.values.ravel()))
print(grid_search.best_params_)
bestModel_r = RandomForestClassifier(n_estimators=20, min_samples_split=0.003, min_samples_leaf=0.000325, max_depth=12, max_features="log2", criterion="gini")
bestModel_r.fit(Xr_train, yr_train.values.ravel())
print("Accuracy on test set:", accuracy_score(bestModel_r.predict(Xr_test), yr_test))
parameters = {
"n_estimators":[10, 20],
"min_samples_split": np.linspace(0.0001, 0.0005, 5),
"min_samples_leaf": np.linspace(0.000001, 0.00001, 5),
"max_depth":[12, 8, 4],
"max_features":["auto", "log2", "sqrt"],
"criterion": ["gini", "entropy"],
}
grid_search = GridSearchCV(model, parameters, verbose = 1, n_jobs=-1)
grid_search.fit(Xw_train, yw_train.values.ravel())
print(grid_search.score(Xw_train, yw_train.values.ravel()))
print(grid_search.best_params_)
bestModel_w = RandomForestClassifier(n_estimators=20, min_samples_split=0.0002, min_samples_leaf=5.5e-06, max_depth=12, max_features="log2", criterion="entropy")
bestModel_w.fit(Xw_train, yw_train.values.ravel())
print("Accuracy on test set:", accuracy_score(bestModel_w.predict(Xw_test), yw_test))
The evaluation metric for this project is "Log Loss". For the N wines in the test data set, the metric is calculated as:
$\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} y_i p_i + (1-y_i) \log(1-p_i)$
where $y$ is the true (but withheld) quality outcome for wine $i$ in the test data set, and $p$ is the predicted probability of good quality for wine $i$. Larger values of $\mathcal{L}$ indicate poorer predictions.
yr_predict = bestModel_r.predict(Xr_test)
yw_predict = bestModel_w.predict(Xw_test)
def accuracy(y, y_hat):
return np.sum(np.abs(y - y_hat) < 0.5) / y.size
def mae(y, y_hat):
return np.mean(np.abs(y-y_hat))
def rmse(y, y_hat):
return np.sqrt(np.mean((y-y_hat)**2))
Red wine classification
print("accuracy =", accuracy_score(yr_test.ravel(), yr_predict))
print("mae =", mae(yr_test.ravel().astype(np.float64), yr_predict))
print("rmse =", rmse(yr_test.ravel().astype(np.float64), yr_predict))
White wine classification
print("accuracy =", accuracy_score(yw_test.ravel(), yw_predict))
print("mae =", mae(yw_test.ravel().astype(np.float64), yw_predict))
print("rmse =", rmse(yw_test.ravel().astype(np.float64), yw_predict))
$mae < rmse$ cause when the values are different, the error is greater than 1!